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Abstract — This article develops a general detection theory for 
speech analysis based on time-varying autoregressive models, 
which themselves generalize the classical linear predictive speech 
analysis framework. This theory leads to a computationally 
efficient decision-theoretic procedure that may be applied to 
detect the presence of vocal tract variation in speech waveform 
data. A corresponding generalized likelihood ratio test is derived 
and studied both empirically for short data records, using 
formant-like synthetic examples, and asymptotically, leading to 
constant false alarm rate hypothesis tests for changes in vocal 
tract configuration. Two in-depth case studies then serve to 
illustrate the practical efficacy of this procedure across different 
time scales of speech dynamics: first, the detection of formant 
changes on the scale of tens of milliseconds of data, and second, 
the identification of glottal opening and closing instants on time 
scales below ten milliseconds. 

Index Terms — Glottal airflow, likelihood ratio test, linear 
prediction, nonstationary time series, vocal tract variation 



I. Introduction 

THIS article presents a statistical detection framework 
for identifying vocal tract dynamics in speech data 
across different time scales. Since the source-filter view of 
speech production motivates modeling a stationary vocal tract 
using the standard linear-predictive or autoregressive (AR) 
model (2), it is natural to represent temporal variation in 
the vocal tract using a time-varying autoregressive (TVAR) 
process. Consequently, we propose here to detect vocal tract 
changes via a generalized likelihood ratio test (GLRT) to 
determine whether an AR or TVAR model is most appropriate 
for a given speech data segment. Our main methodological 
contribution is to derive this test and describe its asymp- 
totic behavior. Our contribution to speech analysis is then 
to consider two specific, in-depth case studies of this testing 
framework: detecting change in speech spectra, and detecting 
glottal opening and closing instants from waveform data. 

Earlier work in this direction began with the fitting of 
piecewise-constant AR models to test for nonstationarity B), 
Q. However, in reality, the vocal tract often varies slowly, 
rather than as a sequence of abrupt jumps; to this end, (5J - © 
studied time-varying linear prediction using TVAR models. In 
a more general setting, Kay |9} recently proposed a version of 
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the Rao test for AR vs. TVAR determination; however, when 
available, likelihood ratio tests often outperform their Rao 
test counterparts for finite sample sizes fTO}. Nonparametric 
approaches to detecting spectral change in acoustic signals 



were proposed by the current authors in 1 11 1, 1 12 1. 

Detecting spectral variation across multiple scales is an 
important first step toward appropriately exploiting vocal tract 
dynamics. This can lead to improved speech analysis algo- 
rithms on time scales on the order of tens of milliseconds 
for speech enhancement ]TT) , p3)-fT5), classification of time- 
varying phonemes such as unvoiced stop consonants |7|, and 
forensic voice comparison |16|. At the sub-segmental time 
scale (i.e., less than one pitch period), sliding-window AR 
analysis has been used to capture vocal tract variation and to 
study the excitation waveform as a key first step in applications 
including inverse filtering [17], speaker identification |18|, 
synthesis |19|, and clinical voice assessment J20) . 

In the first part of this article, we develop a general 
detection theory for speech analysis based on TVAR models. 
In Section [II] we formally introduce these models, derive their 
corresponding maximum-likelihood estimators, and develop a 
GLRT appropriate for speech waveforms. After providing ex- 
amples using real and synthetic data, including an analysis of 
vowels and diphthongs from the TIMIT database pT| , we then 
formulate in Section[lIl]a constant false alarm rate (CFAR) test 
and characterize its asymptotic behavior. In Section IV we 



discuss the relationship of our framework to classical methods, 
including the piecewise-constant AR approach of pj. 

Next, we consider two prototype speech analysis applica- 
tions: in Section [V] we apply our GLRT framework to detect 
formant changes in both whispered and voiced speech. We 
then show how to detect glottal opening and closing instants 
via the GLRT in Section [Vl] We evaluate our results on 
the more difficult problem of detecting glottal openings p2| 
using ground-truth data obtained by electroglottograph (EGG) 
analysis, and also show performance comparable to methods 
based on linear prediction and group delay for the task of 
identifying glottal closures. We conclude and briefly discuss 
future directions in Section IVlTl 



II. Time-Varying Autoregressions and Testing 

A. Model Specification 

Recall the classical pth-order linear predictive model for 
speech, also known as an AR(p) autoregression |2): 

p 

AR(p): x[n] = aix[n — i] + aw[n], (1) 

i=l 

where the sequence w[n] is a zero-mean white Gaussian 
process with unit variance, scaled by a gain parameter a > 0. 
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A more flexible pth-order time-varying autoregressive model 
is given by the following discrete-time difference equation |5 1: 



TVAR(p): 



x\n 



i=l 



ai[n]x[n — i] + <ra/[n] 



(2) 



In contrast to ([TJ, the linear prediction coefficients a% [n] of (T5J 
are time-dependent, implying a nonstationary random process. 

The model of (T5J requires specification of precisely how the 
linear prediction coefficients evolve in time. Here we choose 
to expand them in a set of q + 1 basis functions fj [n] weighed 
by coefficients o^- as follows: 



3=0 



fj[n], for all 1 < i < p. 



(3) 



We assume throughout that the "constant" function fo [n] = 1 
is included in the chosen basis set, so that the classical AR(p) 
model of ([1} is recovered as a, = a iQ ■ 1 whenever = 
for all j > 0. Many choices are possible for the functions 
fj[n] — Legendre p3| and Fourier |5J polynomials, discrete 
prolate spheroidal functions [6|, and even wavelets [24] have 
been used in speech applications. 

The functional expansion of (T3J was first studied in (23), 
p5) , and subsequently applied to speech analysis by (5]-j[7j, 
among others. Coefficient trajectories a,i[n] have also been 
modeled as sample paths of a suitably chosen stochastic pro- 
cess (see, e.g., [26 1). In this case, however, estimation typically 
requires stochastic filtering fT3) or iterative methods J27) in 
contrast to the least-squares estimators available for the model 
of Q, which are described in Section |Tl-C| below. 

B. AR vs. TVAR Generalized Likelihood Ratio Test ( GLRT) 

We now describe how to test the hypothesis Ho that a 
given signal segment x = (x[0] x[l] ■ ■ ■ x[N— 1]) T has been 
generated by an AR(p) process according to ([TJ, against the 
alternative hypothesis Hi of a TVAR(p) process as specified 
by |2j and ([3j above. We introduce a GLRT to examine 
evidence of change in linear prediction coefficients over time, 
and consequently in the vocal tract resonances that they 
represent in the classical source-filter model of speech. 

According to the functional expansion of |3j, the TVAR(p) 
model of (|2j is fully described by p(q + 1) expansion coeffi- 
cients aiij and the gain term a. For convenience we group the 
coefficients otij into q + 1 vectors a,-, < j < q, as 

a ■ = (aij a 2j ■■■ a pj ) T . 

We may then partition a vector a £ IRp( < ?+ 1 ) x1 into blocks 
associated to the AR(p) portion of the model char, and the 
remainder a TV , which captures time variation: 

«-(«ar I a Tv) T =(«o I "T a l ■■■ a q) T ■ ( 4 ) 
Recalling that the TVAR(p) model (hypothesis Hi) reduces 
to an AR(p) model (hypothesis Ho) precisely when atj = 
for all j > 0, we may formulate the following hypothesis test: 

Model : TVAR(p) with parameters a, a 2 ; 

Ho : a.j — for all j > 0, 



Hypotheses : 



Hi : olj ■ ^ for at least one j > 0. 



(5) 



Estimate a... a.. 



Estimate a 2 
under H. 



T(») 



Estimate a AI 
under H n 



Estimate a 2 
under H n 



(«-p)ln(o-;,) 



Fig. 1. Computation of the GLRT statistic T(x) according to Section [lI-C| 

Each of these two hypotheses in turn induces a data likelihood 
in the observed signal x £ M Arxl , which we denote by p-Hi(') 
for i — 1,2. The corresponding generalized likelihood ratio 
test comprises evaluation of a test statistic T(x), and rejection 
of Ho in favor of Hi if T(x) exceeds a given threshold 7: 



T{x) = 2 In 



Hi 

^ 7- 



(6) 



C. Evaluation of the GLRT Statistic 

The numerator and denominator of |6| respectively im- 
ply maximum-likelihood (ML) parameter estimates of a = 
(aJ^ R I £*xv) T anc l a o m HJ under the specified TVAR(p) 
and AR(p) models, along with their respective gain terms a 2 . 
Intuitively, when Ho is in force, estimates of cctv will be 
small; we formalize this notion in Section |III-A| by showing 
how to set the test threshold 7 to achieve a constant false alarm 
rate. 

As we now show, conditional ML estimates are easily 
obtained in closed form, and terms in |6j reduce to estimates 
of a 2 under hypotheses Ho and Hi, respectively. Given N 
observations, partitioned according to 

\T 



X X]\[— p) 



T A 



;[0] ■•• x\p- 1] I x[p] ■■■ x[N- 1] 



the joint probability density function of a, a 2 is given by: 



p(x ; a., a 2 ) = p(x N - 



a,a 2 )p(x p ; a, a 2 ). (7) 



Here the notation | reflects conditioning on random variables, 
whereas ; indicates dependence of the density on non-random 
parameters. As is standard practice, we approximate the un- 
conditional data likelihood of (TTJ by the conditional likelihood 
p(xn- p \x p ; a, a 2 ), whose maximization yields an estimator 
that converges to the exact (unconditional) ML estimator as 
N — >• 00 (see, e.g., p8| for this argument under Ho)- 
Gaussianity of w[n) implies the conditional likelihood 



(x N - 



(2 7 ra 2 )( JV -f)/ 2 



exp 



E 



2a 2 



where e[n] = x[n] - Y%=i Z)'=o a ijfA n ] x [ n ~ *] is me 
associated prediction error. The log-likelihood is therefore 



\np(x 



N-p 



2\ 

x p ; a, a ) = — 



N-p 



ln(27T(T ) - 



x N - 



H-a|| 



2a 2 



(8) 

where the (n-p+l)th row of the matrix H x E r( w -p) x p(<?+ 1 ) 
is given by the Kronecker product (x[n — 1] ■ ■ ■ x[n — p]) (g> 
(fo[n] fi[n] ■ ■ ■ f q [n}) for any p < n < N - 1. 
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Fig. 2. Example of GLRT detection performance for a "formant-like" synthetic TVAR(2) signal. Left: A test signal and its TVAR coefficients are shown at 
top, with pole trajectories and AR vs. TVAR estimates below. Right: Operating characteristics of the corresponding GLRT (p = 2 TVAR coefficients, q = 4 
Legendre polynomials, f s = 1 6 kHz) shown for various frequency jumps and data lengths. 



Maximizing ([8]) with respect to a therefore yields the least- 
squares solution of the following linear regression problem: 



H r a + aw, 



(9) 



where w = (w[p) ■ ■ ■ w[N — 1]) T . Consequently, the condi- 
tional ML estimate of a. follows from ([8} and |9]) as 

a=(H^H x y 1 Hlx N _ p . (10) 

The estimator of ( fT0| ) corresponds to a generalization of the 
covariance method of linear prediction — to which it exactly 
reduces when the number q of non-constant basis functions 
employed is set to zero (5); we discuss the corresponding 



generalization of the autocorrelation method in Section IV-B 



The conditional ML estimate of a 2 is obtained by substitut- 
ing (JlO]) into ((8) and maximizing with respect to a 2 , yielding 



N-p 



N-l, 

El 



p i 

t=lj=0 



otijfj[n}x[n}x[n-i\ . (11) 



Under Ho (the time-invariant case), the estimator of ( JTTj ) 
reduces to the familiar a 2 = r xx [0] — Y^i=i a i0^xx[i]> where 
r M [r] is the autocorrelation function of x\n\ at lag r. 

In summary, the conditional ML estimates of ckar, ctrv and 
a 2 under Hi are obtained using ( fT0) i and ( fTTj ), respectively. 
Estimates of c*ar and a 2 under Ho are obtained by setting 
q = in ( [Tol l and ( fTT| . Substituting these estimates into the 
GLRT statistic of we recover the following intuitive form 
for T(x), whose computation is illustrated in Fig. [TJ 



T(x) = {N-p)ln(<7 2 no /a 2 ni 



(12) 



D. Evaluation of GLRT Detection Performance 

To demonstrate typical GLRT behavior, we first consider 
an example detection scenario involving a "formant-like" 
signal synthesized by filtering white Gaussian noise through 



a second-order digital resonator. The resonator's center fre- 
quency is increased by 6 radians halfway through the duration 
of the signal, while its bandwidth is kept constant; an example 
560-sample signal with S = 77r/80 radians is shown in Fig. [5] 

Detection performance in this setting is summarized in the 
right-hand panel of Fig. [2j which shows receiver operating 
characteristic (ROC) curves for different signal lengths N 
and frequency jump sizes S. These were varied in the ranges 
N e {80,240,400,560} samples (10 ms increments) and 
8 € {7r/80,37r/80,57r/80,77r/80} radians (200 Hz incre- 
ments), and 1000 trial simulations were performed for each 
combination. To generate data under Hq, S was set to zero. In 
agreement with our intuition, detection performance improves 
when 5 is increased while N is fixed, and vice versa — simply 
put, larger changes and those occurring over longer intervals 
are easier to detect. Moreover, even though the span of the 
chosen Legendre polynomials does not include the actual 
piecewise-constant coefficient trajectories, the norm of their 
projection onto this basis set is sufficiently large to trigger a 
detection with high probability. 

We next consider a large-scale experiment designed to 
test the sensitivity of the test statistic T(x) to vocal tract 
variation in real speech data. To this end, we fitted AR(10) 
and TVAR(10) models (with q = 4 Legendre polynomials) 
to all instances of the vowels /eh/, /ih/, /ae/, /ah/, /uh/, /ax/ 
(as in "bet,' "bit," "bat," "but," "book," and "about") and the 
diphthongs /ow/, /oy/, /ay/, /ey/, /aw/, /er/ (as in "boat," "boy," 
"bite," "bait," "bout," and "bird") in the training portion of the 
TIMIT database pTJ. Data were downsampled to 8 kHz, and 
values of T(x) were averaged across all dialects, speakers, 
and sentences (50, 000 vowel and 25, 000 diphthong instances 
in total). 

Per-phoneme averages are reported in Table [TJ and indicate 
considerably stronger detections of vocal tract variation in 
diphthongs than in vowels — and indeed a two-sample t test 
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TABLE I 

Vocal Tract Variation in TIMIT Vowels & Diphthongs. 



Vowel 


eh 


ih 


ae 


ah 


uh 


ax 


T{ X ) 


67.5 


60.5 


94.6 


63.8 


58.9 


32.1 


Diphthong 


ow 


oy 


ay 


ey 


aw 


er 


T( X ) 


134.1 


302.4 


187.4 


130.6 


161.6 


133.0 



easily rejects (p-value ~ 0) the hypothesis that the average 
values of T(x) for the two groups are equal. This finding 
is consistent with the physiology of speech production, and 
demonstrates the sensitivity of the GLRT in practice. 

III. Analysis of Detection Performance 

To apply the hypothesis test of |5]), it is necessary to select 
a threshold 7 as per |6]), such that the null hypothesis of a 
best-fit AR(p) model is rejected in favor of the fitted TVAR(p) 
model whenever T(x) > 7. Below we describe how to choose 
7 to guarantee a constant false alarm rate (CFAR) for large 
sample sizes, and give the asymptotic (in N) distribution of the 
GLRT statistic under Ho and Hi, showing how these results 
yield practical consequences for speech analysis. 

A. Derivation of GLRT Asymptotic s and CFAR Test 

Under suitable technical conditions (29|, likelihood ratio 
statistics take on a chi-squared distribution xi(0) as the sample 
size N grows large whenever Ho is in force, with the degrees 
of freedom d equal to the number of parameters restricted 
under the null hypothesis. In our setting, d = pq since the pq 
coefficients «tv are restricted to be zero under Ho, and we 
may write that T(x) ~ Xp 9 (0) under Ho as N —> 00. 

Thus, we may specify an allowable asymptotic constant 
false alarm rate for the GLRT of Q, defined as follows: 

iv lim o Pr{T( a; )>7;W } = Pr{x^(0)>7}- (13) 

Since the asymptotic distribution of T(x) under Ho depends 
only on p and q, which are set in advance, we can determine a 
CFAR threshold 7 by fixing a desired value (say, 5%) for the 
right-hand side of ( fT3| ), and evaluating the inverse cumulative 
distribution function of Xpq(0) to obtain the value of 7 that 
guarantees the specified (asymptotic) constant false alarm rate. 

When x is a TVAR process so that the alternate hypothesis 
Hi is in force, T(x) instead takes on (as N —> 00) a 
noncentral chi-squared distribution XdW- l ts noncentrality 
parameter A > depends on the true but unknown parameters 
of the model under Hi, thus in general 



T(x) 



~ X„„(A), 



A = under Ho, 
A>0 under "Hi. 



(14) 



It is easily shown by the method of [9| that the expression for 
A in the case at hand is given by 



A = ol^{F t F ® a- 2 R)a TV , 



(15) 



where ■ denotes the Schur complement with respect to 
the first p x p matrix block of its argument, the (J + l)th 



column of the matrix 

UM fj\p + i] ■■■ 
I 



R 



F e 

fjW- 

Txx [0] 



p(JV-p)x(g+l) 



is given by 



1]) , and R is given by: 



j[p- 

;[P- 



2] 



\r X x [p - 1] r xx [p - 2] 



zx[0] J 



Here {r X!C [0], r xx [l], . . . , r xx [p - 1]} is the autocorrelation 
sequence corresponding to c*ar (given, e.g., by the "step-down 
algorithm" J28) ). The expression of (jT3J follows from the 
fact that F T F <g) a~ 2 R is the Fisher information matrix for 
our TVAR(p) model; its Schur complement arises from the 
composite form of our hypothesis test, since the parameters 
c*ar, o -2 are unrestricted under Ho- 

More generally, we may relate this result to the underlying 
TVAR coefficient trajectories a,i[n], arranged as columns of 
a matrix A, with each column-wise mean trajectory value a 
corresponding entry in a matrix A. Letting A = A— A denote 
the centered columns of A, and noting both that F T F ® R = 
F T F ® R and that F(F T F)~ 1 F T A = A when Hi is in 
force, properties of Kronecker products pO) can be used to 
show that ( fT3] > may be written as 

- 2 tr( ARA T ). 



A 



(16) 



Thus A depends on the centered columns of A, which contain 
the true but unknown coefficient trajectories aj[n] minus their 
respective mean values. 

B. Model Order Selection 

The above results yield not only a practical CFAR 
threshold-setting procedure, but also a full asymptotic descrip- 
tion of the GLRT statistic of |6]l under both Ho and "Hi. In 
light of this analysis, it is natural to ask how the TVAR model 
order p should be chosen in practice, along with the number 
q of non-constant basis functions. In deference to the large 
literature on the former subject p), we adopt here the standard 
"2 coefficients per 1 kHz of speech bandwidth" rule of thumb. 

Intuitively, the choice of basis functions should be well 
matched to the expected characteristics of the coefficient tra- 
jectories a,i[n]. To make this notion quantitatively precise, we 
appeal to the results of (fT4]>— (|T6|> as follows. First, the statis- 
tical power of our test to successfully detect small departures 
from stationarity is measured by the quantity Pr {XdW > l}- 
A result of [31] then shows that for fixed 7, the power function 
Pr {Xd( A ) > 7} is: 

1) Strictly monotonically increasing in A, for fixed d; 

2) Strictly monotonically decreasing in d for fixed A. 

Each of these properties in turn yields a direct and important 
consequence for speech analysis: 

• Test power is maximized when A attains its largest value: 
For fixed p and q, the noncentrality parameter A of ( fTo*] ) 
determines the power of the test as a function of a 1 and 
the true but unknown coefficient trajectories A. 

• Overfitting the data reduces test power: Choosing p or 
q to be larger than the true data-generating model will 
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Effect of Model Order on Detection Performance 
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Fig. 3. The effect of overfitting on the detection performance of the GLRT 
statistic for the synthetic signal of Fig. [2] An increase in the model order — p 
(left) and q (right) — decreases the probability of detection at any CFAR level. 

result in a quantifiable loss in power, as A will remain 
fixed while the degrees of freedom increase. 

The first of these consequences follows from Property [T] 
above, and reveals how test power depends on the energy 
of the centered TVAR trajectories A = A — A for fixed 
A and p,q,a 2 . To verify the second consequence, observe 
that the product ARA T remains unaffected by an increase in 
either p or q beyond that of the true TVAR(p) model. Then 
by Property [2] the corresponding increase in the degrees of 
freedom pq will lead to a loss of test power. 

This analysis implies that care should be taken to ade- 
quately capture the energy of TVAR coefficient trajectories 
while guarding against overfitting; this formalizes our ear- 
lier intuition and reinforces the importance of choosing a 
relatively low-dimensional subspace formed by the span of 
low-frequency basis functions whose degree of smoothness 
is matched to the expected TVAR(p) signal characteristics 
under "Hi. This conclusion is further illustrated in Fig. [5] 
which considers the effects of overfitting on the "formant-like" 



synthetic example of Section II-D with p = 2, N = 100 sam- 
ples, S = 77r/80 radians, and piecewise-constant coefficient 
trajectories. Not only is the effect of overfitting p apparent in 
the left-hand panel, but the detection performance also suffers 
as the degree q of the Legendre polynomial basis is increased, 
as shown in the right-hand panel. 

IV. Relationship to Classical Approaches 

We now relate our hypothesis testing framework to two 
classical approaches in the literature. First, we compare its 
performance to that of Brandt's test |3j, which has seen wide 
use both in earlier Q, (32| and more recent studies p3] , 
p3j , p4| , for purposes of transient detection and automatic 
segmentation for speech recognition and synthesis. Second, 
we demonstrate its advantages relative to the autocorrelation 
method of time-varying linear prediction [5], showing that data 
windowing can adversely affect detection performance in this 
nonstationary setting. 

A. Classical Piecewise-Constant AR Approach 

A related previous approach is to model x as an AR 
process with piecewise-constant parameters that can undergo 
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Fig. 4. Comparing the detection performance of the statistic of j6} and that 
of {T7J: (top-left) compa rison using the piecewise-constant signal (N = 100, 
<5 = 5tt/80) of Section |II-D| with p = 2 and q = 2 Legendre polynomials 
used for computing j6j; (top-right) piecewise-linear center frequency of the 
digital resonator used to generate the 2nd synthetic example; (bottom-left) 
comparison using the piecewise-linear signal (N = 300) with p = 2 and q = 
3 Legendre polynomials used for computing j6j; (bottom-right) histogram of 
the changepoint r that maximizes the test statistic T^,(x) for each instantiation 
of the signal with piecewise-linear TVAR coefficient trajectories. 

at most a single change [35|. The essence of this ap- 
proach, first employed in the speech setting by |3J, is to 
split x into two parts according to x = (x r | Xjv- r ) = 
(x[0] ■ ■ ■ x[r — 1] | x[r] ■ ■ ■ x[N — 1]) T for some fixed r, and 
to assume that under Hq, x is modeled by an AR(p) process 
with parameters an, whereas under Hi, x r and x^^ r are 
described by distinct AR(p) processes with parameters a r and 
ajv-r, respectively. 

In this context, testing for change in AR parameters at 
some known r can be realized as a likelihood ratio test; the 
associated test statistic T' r (x) is obtained by applying the 
covariance method to x, x r , and x^_ r in order to estimate 
an, a r , and otN- r , respectively. However, since the value of 
r is unknown in practice, T' r {x) must also be maximized over 
r, yielding a test statistic T'(x) as follows: 



T'(x) = maxT^(s) 2p < r < N - 2p, with 



t;,(x) 4 



SUp p Hl (x r ;a r )pn 1 (xN-r,OtN-r) 

snpp-H (x;a ) 

ota 



(17) 



(18) 



We compared the detection performance of the GLRT 
statistic of (|6]l with that of ( fl7] i on both the piecewise-consta«f 
signal of Fig. [2] and a piecewise-Z/near TVAR(2) signal to 
illustrate their respective behaviors — the resulting ROC curves 
are shown in Fig.H] In both cases, it is evident that the TVAR- 
based statistic of has more power than that of ( fTTj l, in part 
due to the extra variability introduced by maximizing over all 
values of r in ( fT7j ) — especially those near the boundaries of 
its range. Even in the case of the piecewise-constant signal, 
correctly matched to the assumptions underlying ( fT7j ), the 
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TVAR-based test is outperformed only when r is known a 
priori, and ([18} is used. This effect is particularly acute in the 
small sample size setting — an important consideration for the 



single-pitch-period case study of Section VI 

This example demonstrates that any estimates of r can be 
misleading under model mismatch. As shown in the bottom- 
right panel of Fig. |4] the detected changepoint is often esti- 
mated to be near the start or end of the data segment, but 
no "true" changepoint exists since the time-varying center 
frequency is continuously changing. Thus piecewise-constant 
models are only simple approximations to potentially complex 
TVAR coefficient dynamics; in contrast, flexibility in the 
choice of basis functions implies applicability to a broader 
class of time-varying signals. 

Note also that computing ( flT) ) requires brute-force evalua- 
tion of ( p"8j ) for all values of r, whereas (|6]l need be calculated 
once. Moreover, T'(x) fails to yield chi-squared (or any 
closed-form) asymptotics |35) , thus precluding the design of 
a CFAR test and any quantitative evaluation of test power. 

B. Classical Linear Prediction and Windowing 

Recall that our GLRT formulation of Section [TTJ, stemming 
from the TVAR model of Q, generalized the covariance 
method of linear prediction to the time-varying setting. The 
classical autocorrelation method also yields least-squares esti- 
mators, but under a different error minimization criterion than 
that corresponding to conditional maximum likelihood. To see 
this, consider the TVAR model 



x\n 



p 

E< 

i=l 



- aw n\. 



(19) 



in lieu of |2]). Grouping the coefficients ay into p vectors = 
(ctio an ■ ■ ■ onq) T , 1 < i < p, induces a partition of the 



~T\T 



expansion coefficients given by a = [a\ ctj 
a permutation of elements of a in Q. The autocorrelation 
estimator of a is then obtained by minimizing the prediction 
error over all n £ Z, while assuming that x[n] = for all n ^ 
[0, . . . , N — 1], and is equivalent to the least-squares solution 
of the following linear regression problem: 



x = H x ol + aw. 



(20) 



where w = (w[Q] ■■■ w[N — 1]) T ) and the nth row of 
H x G ]R Arx P(«+ 1 ) is given by (f [n - l]x[n - 1] • • • f [n - 
p]x[n— p] ■■■ f q [n — l]x[n — 1] ••• f q [n — p]x[n —p\). The 
autocorrelation estimate of a then follows from ( p0| > asQ 

§ = {HlH^Hlx. (21) 

Moreover, when the autocorrelation method is used for 
spectral estimation in the stationary setting, x is often pre- 
multiplied by a smooth window. To empirically examine 
the role of data windowing in the time-varying setting, we 
generated a short 196-sample synthetic TVAR(2) signal x 
■ — i , — — 

'As noted by 151, H/ f . is a block-Toeplitz matrix comprised of p 2 
symmetric blocks of size (q + 1) X (q + 1) — this special structure arises as a 
direct consequence of the synchronous form of the TVAR trajectories in jT9j. 
ThuSjJhe multichannel Levinson-Durbin recursion [ 361 may be used to invert 
H^H^ directly. 
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Fig. 5. Comparison of covariance- and autocorrelation-based test statistics, 
based on 5000 trials with a short (196-sample) data record. Top: ROC curves 
showing the effects of data windowing on detection performance. Bottom: 
Detail of how windowing changes the distribution of T(x) under "Ho- 
using q — (Ho) and q = 2 (Hi) non-constant Legendre 
polynomials, and fitted x using AR(3) and TVAR(3) models — 
with the extra autoregressive order expected to capture the 
effects of data windowing. We then generated an ROC curve 
associated with the GLRT statistic of ( fL?} , shown in the top 
panel of Fig. E\ along with ROC curves corresponding to an 
evaluation of (12]> following the autocorrelation — rather than 
the covariance — method, both with and without windowing. 

The bottom panel of Fig.[5]shows the empirical distributions 
of both autocorrelation-based test statistics under Ha, and 
indicates how windowing has the inadvertent effect of hinder- 
ing detection performance in this setting. We have observed 
the effects of Fig. [5] to be magnified for even shorter data 
records, implying greater precision of the covariance-based 
GLRT approach, which also has the advantage of known test 
statistic asymptotics under correct model specification. 

V. Case Study I: Detecting Formant Motion 

We now introduce a GLRT-based sequential detection al- 
gorithm to identify vocal tract variation on the scale of tens 
of milliseconds of speech data, and undertake a more refined 
analysis than that of Section |II-D| to demonstrate its efficacy 
on both whispered and voiced speech. Our results yield strong 
empirical evidence that appropriately specified TVAR models 
can capture vocal tract dynamics, just as AR models are known 
to provide a time-invariant vocal tract representation that is 
robust to glottal excitation type. 

A. Sequential Change Detection Scheme 

Our basic approach is to divide the waveform into a 
sequence of K short-time segments {xx, X2, • • • , Xk} using 
shifts of a single iVo-sample rectangular window, and then to 
merge these segments, from left to right, until spectral change 
is detected via the GLRT statistic of (J51. The procedure, 
detailed in Algorithm [T] begins by merging the first pair of 
adjacent short- time segments X\ and X2 into a longer segment 
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Algorithm 1 Sequential Formant Change Detector 



(a) Change Detection with 1 st Formant Only 



1) Initialization: set 7 via (13 1, input waveform data x 

• Compute K short- time segments {x%, . . . , Xk} of 
x using shifts of a rectangular window 

• Set k = 1, x\ = x\, x r = x 2 

• Set a marker array C[k] =0 for all 1 < k < K 

2) While k < K 

• Set x m = X[ + x r and compute T(x m ) via |6]) 

• If T(x m ) < 7 (no formant motion within x m ) 

- Set X[ = x m , C[k] — 

Else (formant motion detected within x m ) 

- Set xi = Xk, C[k] = 1 

• Set x r = ajfc+i, k = k + 1 

3) Return the set of markers {k : C[k] = 1} 



x m and computing T(x m ); failure to reject Ho implies that 
x m is stationary. Thus, the short-time segments remain merged 
and the next pair considered is (x mi x^). This procedure 
continues until Ho is rejected, indicating the presence of 
change within the merged segment under consideration. In 
this case, the scheme is re-initialized, and adjacent short-time 
segments are once again merged until a subsequent change in 
the spectrum is detected. 

In Algorithm[T] the CFAR threshold 7 of ( fT3~| is set prior to 
observing any data, by appealing to the asymptotic distribution 
of T(x) under Ho developed in Section 



III-A 



In principle, 

the time resolution to within which change can be detected is 
limited only by Nq. Using arbitrarily short windows, however, 
increases the variance of the test statistic and results in 
an increase in false alarms — a manifestation of the Fourier 
uncertainty principle. Decreasing 7 also serves to increase the 
(constant) false alarm rate, and leads to spurious labeling of 
local fluctuations in the estimated coefficient trajectories (e.g., 
due to the position of the sliding window relative to glottal 
closures) as vocal tract variation. 

B. Evaluation with Whispered Speech 

In order to evaluate the GLRT in a gradually more realistic 
setting, we first consider the case of whispered speech to avoid 
the effects of voicing, and apply the formant change detection 
scheme of Algorithm [T] to whispered utterances containing 
slowly-varying and rapidly-varying spectra, respectively. 

The waveform used in the first experiment comprises a 
whispered vowel /a/ (as in "father") followed by a diphthong 
/al/ (as in "liar"). It was downsampled to 4 kHz in order to 
focus on changes in the first two formants, and Algorithm [T] 
was applied to this waveform as well as to its 0-1 kHz and 
1-2 kHz subbands (containing the first and second formants, 
respectively). 

Results are summarized in Fig. [6] and clearly demonstrate 
that the GLRT is sensitive to formant motion. All three 
spectrograms indicate that spectral change is first detected near 
the boundary of the vowel and diphthong — precisely when the 
vocal tract configuration starts to change. Subsequent consec- 
utive changes are found when sufficient formant change has 
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(d) Time-domain waveform 
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Fig. 6. Result of applying Algorithm[T](16 ms rectangular windows, p = 4, 
q = 2 Legendre polynomials, 1% CFAR) to detect formant movement in 
the whispered waveform /a al/. Spectrograms corresponding to subbands 
containing the first formant only (a) second formant only (b) and both formants 
(c) were computed using 16 ms Hamming windows with 50% overlap, and 
are overlaid with formant tracks computed by WaveSurfer [37]. Black lines 
demarcate times at which formant motion was detected; the time-domain 
waveform overlaid with these boundaries is also shown (d). 

been observed relative to data duration — a finding consistent 
with our earlier observation in Section III-DI that more data 
are required to detect small changes in the AR coefficient 
trajectories, and by proxy the vocal tract, at the same level 
of statistical significance (i.e., same false alarm rate). 

Next observe that whereas three "changepoints" are found 
when the waveform contains two moving resonances, a total 
of three "changepoints" are marked in the single-resonance 
waveforms shown in Figs. |6|a) and |6|b). Intuitively, each 
of these signals can be thought of as having "less" spectral 
change than the waveform shown in Fig. |6jc), which contains 
both formants. Thus, since the corresponding amounts of 
spectral change are smaller, longer short-time segments are 
required to detect formant movement — as indicated by the 
delays in detecting the vowel-diphthong transition seen in 
Figs. |6|a) and (b) relative to (c). 

We next conducted a second experiment to demonstrate that 
the GLRT can also detect a more rapid onset of spectral change 
as compared to, e.g., the relatively slow change in the spectrum 
of the diphthong. To this end we applied Algorithm [T] to a 
sustained whispered vowel (/i/ as in "beet"), followed by the 
plosive /t/ at 10 kHz. The results, shown in Fig. [7] indicate 
that no change is detected during the sustained vowel, whereas 
the plosive is clearly identified. 

Finally, we have observed change detection results such as 
these to be robust to not only reasonable choices of p (roughly 
2 coefficients per 1 kHz of speech bandwidth) and q (1-10), 
but also to the size of the initial window length (10-40 ms), 
and the constant false alarm rate (1-20%). 
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Spectrogram 




Spectrogram 
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Fig. 7. Algorithm[T](16 ms windows, p = 10, q = 4 Legendre polynomials, 
1% CFAR), applied to detect formant movement in the whispered waveform 
/i Xl. Its spectrogram (top) is overlaid with formant tracks computed by 
WaveSurfer |37| and black lines demarcating the time instants at which 
formant motion was detected; the time-domain signal is also shown (bottom). 



C. Extension to Voiced Speech 

We next conducted an experiment to show that the TVAR- 
based GLRT is robust to the presence of voicing. We repeated 
the first experiment of Section |V-B| above using a voiced 
vowel-diphthong pair /a al/ over the range 0-4 kHz. The same 
parameter settings were employed, except for the addition of 
two poles to take into account the shape of the glottal pulse 
during voicing |2|. Algorithm [T] yields the results shown in 
Fig. 8(a) which parallel those shown in Fig. [6] for the whis- 
pered case. Indeed, the first change occurs at approximately the 
vowel-diphthong boundary, with subsequent "changepoints" 
marked when sufficient formant movement has been observed. 

The similarities in these results are due in part to the 
fact that the analysis windows employed in both cases span 
at least one pitch period. To wit, consider the synthesized 
voiced phoneme /a/ and the associated GLRT statistic of |6]) 
shown in the top and bottom panels of Fig. 8(b) respectively. 



Even though the formants of the synthesized phoneme are 
constant, the value of T(x) undergoes a stepwise decrease 
from over the 1% CFAR threshold when < 1 pitch period is 
observed, to just above the 50% CFAR threshold when < 1.5 
periods are observed — and finally stabilizes to a level below 
the 50% CFAR threshold after more than two periods are seen. 
In contrast, the GLRT statistic computed for the associated 
whispered phoneme, generated by filtering white noise by a 
vocal tract parameterized by the same formant values and 
shown in the bottom panel of Fig. 8(b) remains time-invariant. 



These results indicate that the periodic excitation during 
voicing has negligible impact on the GLRT statistic when 
longer (i.e., > 2 pitch periods) speech segments are used, 
and explain the robustness of the GLRT statistic T(x) to the 
presence of voicing in the experiments of this section. On the 
other hand, the GLRT is sensitive to the glottal flow when 
shorter speech segments are employed, suggesting that it can 
be also used effectively on sub-segmental time scales, as we 
show in Section [vTl 

VI. Case Study 2: Sub-Segmental Speech Analysis 

We now demonstrate that our GLRT framework can be 
used not only to detect formant motion across multiple pitch 
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(a) Spectrogram of the voiced waveform /a al/ is overlaid with formant 
tracks computed by WaveSurfer 1 37] and black lines demarcating the time 
instants at which formant motion was detected; the time domain signal is 
shown for reference (bottom). 



Synthetic Voiced Phoneme Segment 




Test Statistics 




p' 30 - 1%CFAR 
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(b) Formant-synthesized voiced phoneme IsJ (top) and associated GLRT 
statistic (bottom, green) are shown along with 1% (solid black) and 50% 
CFAR (dashed-black) thresholds. Window lengths of 5 — 35 ms at 1 ms 
(16-sample) increments with p = 6, q = 3 Legendre polynomials were 
used to calculate T(x). Values of T(x) for a whispered /a/ (bottom, blue) 
generated using the same formant values are shown for comparison. 

Fig. 8. Detecting vocal tract dynamics in voiced speech (a) and the impact 
of the quasi-periodic glottal flow on the GLRT statistic T(x) (b). 



periods, as discussed above in Section [V] but also to detect 
vocal tract variations within individual pitch periods. Since 
the vocal tract configuration is relatively constant during the 
glottal airflow closed phase, and undergoes change at its 
boundaries p8) , a hypothesis test for vocal tract variation 
provides a natural way to identify both glottal opening and 
closing instants within the same framework. 

We show below that this framework is especially well 
suited to detecting the gradual change associated with glottal 
openings, and can also be used to successfully detect glottal 
closures. Glottal closure identification is a classical problem 
(see, e.g., p9| for a recent review), with mature engineering 
solutions typically based on features of the linear prediction 
residual or the group delay function (see, e.g., p7[ , fl9) , 
p2| and references therein). In contrast, the slow onset of the 
open phase results in a difficult detection problem, and glottal 
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Waveform 




Typical EGG Derivative 




Fig. 9. Glottal openings and closures demarcated over two pitch periods 
of a typical vowel, shown with idealized glottal flow (top), speech (middle), 
and EGG derivative (bottom) waveforms as a function of time. 



opening detection has received relatively little attention in the 
literature [22], with preliminary results reported only in recent 
conference proceedings p8|, [39]. 



A. Physiology of Sub -Segmental Variations 

Figure [9] illustrates the idealized open and closed glottal 
phases associated with a typical vowel, along with the corre- 
sponding waveform and derivative electroglottograph (DEGG) 
data indicating approximate opening and closing instants |20|, 
[40 1 . In each pitch period, the glottal closure instant (GCI) is 
defined as the moment at which the vocal folds close, and 
marks the start of the closed phase — an interval during which 
no airflow volume velocity is measured at the glottis (top 
panel), and the acoustic output at the lips takes the form of 
exponentially-damped oscillations (middle panel). Nominally, 
the glottal opening instant (GOI) indicates the start of the open 
phase: the vocal folds gradually begin to open until airflow 
velocity reaches its maximum amplitude, after which they 
begin to close, leading to the next GCI. 

Time-invariance of the vocal tract suggests the use of 
linear prediction to estimate formant values during the closed 
phase (2), and then to use changes in these values across 
sliding windows to determine GOI and GCI locations [18|. 
Indeed, as the vocal folds begin to open at the GOI, the 
vocal tract gradually lengthens, resulting in a change in the 
frequency and bandwidth of the first formant pT) — an effect 
that can be explained by a source-filter model with a time- 
varying vocal tract. Furthermore, the assumption that short- 
term statistics of the speech signal undergo maximal change 
in the vicinity of a GCI implies that such regions will exhibit 
large linear-prediction errors. 

B. Detection of Glottal Opening Instants 

We first give a sequential algorithm to detect GOIs via the 
GLRT statistic T(x). To study the efficacy of the proposed 
method, we assume that the timings of the glottal closures 



are available, and use these to process each pitch period 
independently. In addition to evaluating the absolute error rates 
of our proposed scheme using recordings of sustained vowels, 
we also compare it with the method of |17| — a standard 



prediction-error-based approach that remains in wide use, and 



effectively underlies more recent approaches such as [38 1. 

1 ) Sequential GOI Detection Procedure: In contrast to the 
"merging" procedure of Algorithm [T] our basic approach here 
is to scan a sequence of short-time segments x w , induced 
by shifts of an A*o-sample rectangular window initially left- 
aligned with a glottal closure instant, until spectral change is 
detected via the GLRT statistic of Q. 

At each iteration, the window slides one sample to the right, 
and T(x w ) is evaluated; this procedure continues until T(x w ) 
exceeds a specified CFAR threshold 7, indicating that spectral 
change was detected, and signifying the beginning of the open 
phase. In this case, the GOI location is declared to be at the 
right edge of x w . On the other hand, a missed detection results 
if a GOI has not been identified by the time the right edge 
of the sliding window coincides with the next glottal closure 
instant. The exact procedure is summarized in Algorithm [2] 

Algorithm 2 Sequential Glottal Opening Instant Detector 

1) Initialization: input one pitch period of data x between 
two consecutive glottal closure locations gi and cj2 

• Set wi = <?i, w r = wi + No, and set 7 via ( fl"3j ) 

• Set x w = (x[iUi] • • • i[w r ]) 

2) While T(x w ) < 7 and w r < #2 

• Increment wi and w r (slide window to right) 

• Recompute x w and evaluate T(x w ) 

3) If w r < <?2, then return w r as the estimated glottal 
opening location, otherwise report a missed detection. 

Since each instantiation of Algorithm [2] is confined to a 
single pitch period, the parameters A*o, p, and q must be 
chosen carefully. To ensure robust estimates of the TVAR 
coefficients, the window length No cannot be too small; on the 
other hand, if it exceeds the length of the entire closed-phase 
region, then the GOI cannot be resolved. Likewise, choosing a 
small number of TVAR coefficients results in smeared spectral 
estimates, whereas using large values of p leads to high test 
statistic variance and a subsequent increase in false alarms; this 
same line of reasoning also leads us to keep q small. Thus, 
in all the experiments reported in Section VI-B we employ 



No = 50-sample windows, p = 4 TVAR coefficients and the 
first 2 Legendre polynomials as basis functions (q = 1). We 
also evaluated the robustness of our results with respect to 
these settings, and observed that using window lengths of 40- 
60 samples, 3-6 TVAR coefficients, and 2-4 basis functions 
also leads to reasonable results in practice. 

2) Evaluation: We next evaluated the ability of Algorithm|2] 
to identify the glottal opening instants in five sustained vowels 
uttered by a male speaker (109 Hz average F0), synchronously 
recorded with an EGG signal (Center for Laryngeal Surgery 
and Voice Rehabilitation, Massachusetts General Hospital), 
and subsequently downsampled to 16 kHz. The Speech Filing 
System |42) was used to extract DEGG peaks and dips, which 
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Fig. 10. Algorithm [5] applied to detect the GOI in a pitch period of the 
vowel /a/ (top left), shown together with its EGG derivative (bottom left). 
The sliding window is left-aligned with the GCI (solid black line); estimated 
AR coefficients (top right) and the GLRT statistic T(x) of l(5} (bottom right) 
are then computed for each subsequent window position. The detected GOI 
(dashed black line) corresponds to the location of the first determined change 
(at the 15% CFAR level) in vocal tract parameters. 

in turn provided a means of experimentally measured ground 
truth for our evaluations. 

A typical example of GOI detection is illustrated in Fig. [TO] 
which shows the results of applying Algorithm [2] to an excised 
segment of the vowel /a/. The detected GOI in this example 
was declared to be at the right edge of the first short- 
time segment x w for which T(x w ) exceed the 15% CFAR 
threshold 7, and is marked by a dashed black line in all 
four panels of Fig. [10] As can be seen in the bottom-right 
panel, the estimated GOI coincides precisely with a dip in the 
DEGG waveform. Moreover, as the top-right panel shows, this 
location corresponds to a significant change in the estimated 
coefficient trajectories, likely due both to a change in the 
frequency and bandwidth of the first formant (resulting from 
nonlinear source-filter interaction fl8] , |4T|), as well as an 
increase in airflow volume velocity (from zero) at the start of 
the open phase. 

Detection rates were then computed over 75 periods of 
each vowel, and detected GOIs were compared to DEGG 
dips in every pitch period that yielded a GOI detection. 
The resultant detection rates and root mean-square errors 
(RMSE, conditioned on successful detection) are reported in 
Table |H] along with a comparison to the prediction-error-based 
approach of fTTJ , which we now describe. 

TABLE II 

GOI Detection Accuracy (ms, No. missed detections). 





/a/ 


lei 


IV 


lol 


lul 


GLRT RMSE (ms) 


0.69 


1.03 


1.00 


1.15 


0.69 


WMG |17| RMSE (ms) 


1.04 


1.78 


1.13 


1.97 


1.10 


GLRTTvIissed Det. 
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WMG 1 17) Missed Det. 
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3) Comparison with approach of Wong, Market, and Gray 
(WMG) jf/7|/: The approach of fT7) involves first computing a 
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Fig. 1 1 . Comparison of Algorithm [2] and the approach of |17| for GOI 
detection. The EGG derivative for 8 periods of the vowel /a/, and estimated 
AR coefficients, are shown for all sliding window positions (top two panels) 
along with the associated values of T(x) and r)(x) (bottom two panels). True 
and estimated GOI locations are indicated by solid and dashed black lines, 
respectively. Note the variability in the dynamic range of rj(x) from one pitch 
period to the next, and the missed detection (2nd pitch period, bottom panel). 



normalized error measure rj(x w ) for each short-time segment 
x w (induced by a sliding window as in Algorithm |2j, and then 
identifying the GOI instant with the right edge of x w when 
a large increase in r)(x w ) is observed. The measure r)(x w ) 
is obtained by fitting a time-invariant AR(p) model to x w 
(using ( fTO) with q = 0), calculating the norm of the resultant 
prediction error, and normalizing by the energy of short-time 
segment x w . 

Figure [TT provides a comparison of this approach to that of 
Algorithm|2 over 8 periods of the vowel /a/. Here Algorithm [2] 
is implemented with a 15% CFAR level, but the threshold for 
rj(x) must be set manually, since no theoretical guidelines 
are available fT7) . Indeed, as illustrated in the bottom panel 
of Fig. 11 variability in the dynamic range of f](x) across 



pitch periods implies that any fixed threshold will necessarily 
introduce a tradeoff between detection rates and RMSE. In 
this example, lowering the threshold to intersect with 17(05) 
in the second pitch period — and thereby removing the missed 
detection — results in a 25% increase in RMSE. 

The denominator of the GLRT statistic T(x w ) depends on 
the same prediction error residual used to calculate rj(x w ); 



however, as indicated by Fig. 11 it remains much more stable 
across pitch periods. Thus, while the approach of ffTTJ relies 
on large absolute changes in AR residual energy to detect 
glottal openings, that of Algorithm [2] explicitly takes into 
account the ratio of AR to TVAR residual energies — resulting 
in improved overall performance. Indeed, though thresholds 
were set individually for each vowel of Table [II] and manually 
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adjusted to obtain the best RMSE performance while keeping 
the number of missed detections reasonably small, Algorithm[2] 
with a 15% CFAR threshold exhibits both superior detection 
rates and RMSE. 

C. Detection of Glottal Closure Instants 

Although our main focus here is on GOI detection, the 
GLRT statistic of (|6]l may also be employed to detect glottal 
closures. Indeed, under the assumption stated earlier that the 
speech signal undergoes locally maximal change in the vicinity 
of a GCI, a simple GCI detection algorithm immediately sug- 
gests itself: Compute (|6]l for every location of an appropriate 
sliding analysis window, and declare the glottal closure to 
occur at the midpoint of the window with the largest associated 
value of T(x). In this formulation, T(x) is being treated 
simply as a signal with features that may be helpful in finding 
the GCI locations; no test threshold need be set. A typical 
result is shown in the third panel of Fig. 12 obtained using 



the same parameter settings (50-sample window, p = 4, q = 2) 
as in the GOI detection scheme of Section [VI-BI 

We compared this method to two others based on linear 
prediction and group delay, as described above. First, we 
implemented the alternative likelihood-ratio epoch detection 
(LRED) approach of [32) , which tests for a single change 
in AR parameters. Second we used the "front end" of the 
popular DYPSA algorithm for GOI detection p2[ , comprising 
the generation of GCI candidates and their weighting by the 
ideal phase-slope function deviation cost as implemented in 
the Voicebox online toolbox |43| . Table [III] summarizes the 
GCI estimation results under the same conditions as reported 
in Table [TTJ All three methods are comparable in terms of 
accuracy, though the GLRT approach proposed here can be 
used — with the same parameter settings — for both GCI and 
GOI detection. 

table m 
GCI Detection Accuracy (ms). 



Vowel/ GCI RMSE 


/a/ 


Id 


lil 


lol 


lul 


GLRT (No = 50, p = 4,q 


= 2) 


0.47 


1.05 


0.73 


0.97 


1.03 


LRED [32] (N = 72,p = 


= 6) 


1.02 


0.69 


1.00 


0.65 


1.12 


DYPSA Front End (22)(ATq 


= 50) 


0.61 


0.68 


0.70 


0.79 


1.10 
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Fig. 12. Using the GLRT statistic of j6} to rind GCI locations in a segment 
of the vowel /a/. The speech waveform (top), the EGG derivative (middle) and 
the GLRT statistic (bottom) are overlaid with the true (solid, black line) and 
estimated (dashed, black line) glottal closure locations in each pitch period. 

to identify changes in the vocal tract configuration in speech 
data at different time scales. At the segmental level we demon- 
strated the sensitivity of the GLRT to vocal tract variations 
corresponding to formant changes, and at the sub-segmental 
scale, we used it to identify both glottal openings and closures. 

Methodological extensions include augmenting the TVAR 
model presented here to explicitly account for the quasi- 
periodic nature of the glottal flow (or its time derivative), 
and deriving a GLRT statistic corresponding to (|6]l in the 
case where only noisy waveform measurements are available. 
Important next steps in applying the hypothesis testing frame- 
work to practical speech analysis include further development 
of the glottal closure and opening detection schemes of the 
previous section, which were here applied independently in 
each pitch period. Incorporating the dynamic programming 
approach of J22) will likely serve to improve performance, as 
will incorporating the GLRT statistic as part of global frame- 
to-frame cost function in such a framework. 



Results from both our approach and the DYPSA front end 
can in turn be propagated across pitch periods (using, e.g., 
dynamic programming [22 1) to inform a broader class of 
group-delay methods fl9)7though we leave such a system- 
level comparison as the subject of future work. 

VII. Discussion 

The goal of this article has been to develop a statis- 
tical framework based on time-varying autoregressions for 
the detection of nonstationarity in speech waveforms. This 
generalization of linear prediction was shown to yield efficient 
fitting procedures, as well as a corresponding generalized 
likelihood ratio test. Our study of GLRT detection performance 
yielded several practical consequences for speech analysis. 
Incorporating these conclusions, we presented two algorithms 
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